INTRODUCTION

This project is trying to address the question “Is there a significant difference in income between men and women? Does the difference vary depending on other factors such as education, marital status, criminal history, drug use, childhood household factors, profession, etc.”

We are using NLSY97 (National Longitudinal Survey of Youth, 1997 cohort) data set. The National Longitudinal Surveys (NLS) are a set of surveys designed to gather information at multiple points in time on the labor market activities and other significant life events of several groups of men and women. For more than 4 decades, NLS data have served as an important tool for economists, sociologists, and other researchers. The NLSY97 data set contains survey responses on thousands of individuals who have been surveyed every one or two years starting in 1997.

On loading the data from “nlsy97_income.csv”, from the data set you get 8984 observations over 79 variables.

From the 79 given variables given in the data set, I have selected 13 Variables for further analysis which I hypothesised will have the maximum impact on Income Gaps.

DATA SUMMARY

Renaming the Variables

The dataset doesn’t come with very descriptive variable names. I changed the variable names to more descriptive names to get better column names.

Below is the list of variables names selected and their definition:

Variable Description
totalincarceration Total number of incarcerations reported by the respondent
gender Gender of the Respondent (Male/Female)
physicalEmotionalCondition Respondent’s Physical and Emotional Condition that limits School/Work
race Race of the Respondent
biologicalChild Number of biological children born and residing in the household
collegeType Respondent’s School Information (public, private not-for-profit, private for-profit or Not Attended College)
familyIncome Gross family income in the previous year
drugUse Respondent used Hard Drugs since DLI
industry Type of Industry or Business
income Income received by Respondent Last Year
maritalStatus Spouse received income is use to understand if the respondent is married or not.
Project Hypothesis
  1. Total Number of Incarcerations
    • Total Number of Incarceration may have greater impact on women than men. Some occupations such as construction and mining (heavy labor oriented work) is more male dominated and is not impacted by the number of total number of Incarcerations, while occupations such as nurses and nanny which is more female dominated have a huge impact by the total number of incarcerations. This might lead to total number of incarcerations having an impact on Income Gap between men and women.
  2. Physical and Emotional Conditions during school/work
    • Physical and Emotional may have greater impact on income for women than men. Women are considered to be the weaker gender and according to a study done Socioeconomic status (SES), women have a 20% higher chance of depression and socio economic troubles which is not addressed and registered. This might lead to women earning less as compared to men.
  3. Race
    • Race may have greater impact on income for women than men. According to a study done by AAUW, race has an affect on gender wage gap. These affects may be bacause of uneven distribution of women by race.
  4. Number of Children at Home
    • Number of children may have greater impact on women than men. Still living in a Patriarchal society, men are considered to be bread winners and women are considered to be home makers. More number of children at home may lead to women staying at home or taking up part time jobs which pay less.
  5. Type of College
    • Type of college may have greater impact on women than men. According at a study done by AAUW, gender wage gap is affected by the type of college. A private college has better opportunities for women than a non- profit college.
  6. Family Income
    • Family Income may have greater impact on income for women than men. In a male dominated society, generally the male is considered to have a higher percentage share in the family income. Some household, the women earns only if the husband’s salary is not enough to meet ends.s
  7. Drug Use
    • Drug use may have greater impact on women than men. Some occupations such as construction and mining (heavy labor oriented work) which is more male dominated may not have an impact on taking drugs, while occupations such as nurses and nanny which is more female dominated have a huge impact by drug use. This might lead to drug use having an impact on Income Gap between men and women.
  8. Industry
    • Industry should have an impact on gender wage gap. Some occupations such as construction and mining (heavy labor oriented work) which is more male dominated and are better paying jobs for men, while occupations such as nurses and nanny which is more female dominated and are better paying jobs for women.
  9. Marital Status
    • Marital Status may have greater impact on women than men. According to a study done, married men earn more than unmarried men, while unmarried women earn more than married women. They two genders have opposite impact of marital Status.
Renaming the factors

All the factors are currently represented as integers. I used transform() and mapvalues() functions to convert variables to factors and give the factors more meaningful levels

'data.frame':   8984 obs. of  13 variables:
 $ totalincarceration        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ gender                    : Factor w/ 2 levels "female","male": 1 2 1 1 2 1 2 1 2 2 ...
 $ physicalEmotionalCondition: Factor w/ 5 levels "DontKnow","No",..: 2 4 2 5 2 2 2 2 2 2 ...
 $ race                      : Factor w/ 4 levels "Black","Hispanic",..: 4 2 2 2 2 2 2 4 4 4 ...
 $ biologicalChild           : int  -4 -4 -5 2 1 1 -4 -5 -4 -5 ...
 $ collegeType               : Factor w/ 6 levels "Invalid","NotInterviewed",..: 6 6 6 6 6 6 6 6 3 2 ...
 $ familyIncome              : int  50000 81000 150250 -3 130000 55000 14766 66750 110000 -5 ...
 $ highestDegree             : Factor w/ 10 levels "Associate/Junior college (AA)",..: 2 4 1 4 4 4 3 6 6 8 ...
 $ drugUse                   : Factor w/ 6 levels "DontKnow","No",..: 2 2 2 2 2 2 2 2 2 3 ...
 $ industry                  : Factor w/ 20 levels "ACS SPECIAL CODES",..: 20 15 14 5 15 5 19 20 5 12 ...
 $ income                    : int  70000 83000 -2 29000 76000 15000 -5 -5 54000 -5 ...
 $ estimatedIncome           : num  -4 -4 37500 -4 -4 -4 -5 -5 -4 -5 ...
 $ maritalStatus             : Factor w/ 6 levels "DontKnow","No",..: 5 5 6 6 6 6 3 3 6 3 ...
Exploring the DataSet

Starting with comparing the Income between Male and Female using a simple box plot:

OBSERVATION :The above plot suggests that men earn more than women.

Also from the graph we can observe, the data has many outliers which might influence our interpretation of the data set. The income data is positively skewed which might also affect our inferences.

To get a better understanding of the data set, plotting the data distribution using Q-Q Plot

From this we can see that the data is skewed at the edges. A lot of data points are marked as 0 income and the top 2 percent of the data is top coded to the average value of the top 2 percent earning population ie.180331. We need to clean the data set before moving forward with the analysis as dirty data will give us incorrect results.

DATA CLEANING

The Data taken from NLSY97 is messy and has many issues which need to be addressed first before performing any further analysis. There are various problems in the data

PROBLEM 1

Some data values are coded for all the attributes

Top Coded Values Description
-1 Refused to answer
-2 Dont Know
-3 Invalid Skip (Data not retrieved/lost)
-4 Valid Skip (Question not relevant to the respondent)
-5 Respondent not interviewed that year

PROBLEM 2

Top 2 percent Income values are coded to the average value of the top 2 percent of cases ie.180331 as the respondents were not comfortable with declaring their true income in the survey. This will result in a skewed data set as seen in the above data distribution.

Handling Missing Data

To handle the missing values due to multiple reasons, depending on the attribute I have changed some variables to Not Available(NA) and some interpretted to impute the values in different ways.

Variable Refusal Don’t Know Invalid Skip Valid Skip Non-Interview
income Change to Middle value of Estimated Income Range if available Change to Middle value of Estimated Income Range if available - 0 NA
totalincarceration - - - - -
gender - - - - -
physicalEmotionalCondition NA NA - NA -
race - - - - -
biologicalChild - - NA NA NA
collegeType - - NA Did not Attend NA
familyIncome - - NA - NA
highestDegree - - NA NA NA
drugUse Used Drugs Used Drugs NA - NA
industry - - NA Not Working NA
marital Status NA NA - Not Married NA
  • Income
    • Imputed the income with estimated income for respondents who refused to answer or did not know their income but have provided with an estimated income range
    • For respondents the question was not valid, the income is considered as 0 as they are not working. ( All 0 value income are removed in the later step)

As some of the respondents have given an estimated income range rather than their true income,the middle value of the estimated income range can be used to predict their income as this will improve the data set and reduce the number of missing values. They might not be exact but will provide us with a good estimate of their income.

As we are removing the top 2 percent of the income observations to reduce the skewness, it is also important to remove the observations with 0 income as they dont know give us any valid information and the data set will be biased for low income values.

  • College Type
    • For respondents the question was not valid, it is assumed that they did not attend college. ( A new category “Not Attend College” is added.)
  • Industry
    • For respondents the question was not valid, it is assumed that they are not working. ( A new category “Not Working” is added.)
  • Marital Status
    • For respondents the question was not valid, are considered as not married. ( A new category “Not Married” is added.)
    • For respondents who answered yes or no, are considered as married. ( A new category “Married” is added.)

Using the survey question “How much the spouse earn?”, we can infer if the respondent is married or not. Using this information to check if the marital status affects the gender income gap can be inferred.

Handling Top Coded Values

The top 2% earning observations are top coded to the average value of the top 2 percent of cases, it makes the data set skewed and gives a misrespresentation of the data set. Our observations will be influenced heavily by the top coded values and reduce the accuracy of the model. For these reasons the Top Coded data observations are removed for the data set.

Also, to not have a biased data set, all observations with 0 income are also changed to Not Availabes(NA). Similarly for respondents Family Income, all Top coded observations are changed to Not Available(NA).

DATA SUMMARY AFTER CLEANING THE DATA

After Cleaning the Data Summary for all the variables considered in the model(Number and Factor variables)

FACTOR VARIABLES

gender physicalEmotionalCondition race collegeType highestDegree drugUse industry maritalStatus
female:2772 No :4736 Black :1486 Private for Non Profit: 178 High school diploma (Regular 12 year program):2398 Yes : 213 EDUCATIONAL, HEALTH, AND SOCIAL SERVICES :1186 Married :3332
male :2914 Yes : 315 Hispanic:1224 Private for Profit : 211 Bachelor’s degree (BA, BS) :1228 No :5215 PROFESSIONAL AND RELATED SERVICES : 643 Not Married:2314
NA NA’s: 635 Mixed : 50 Public : 733 GED : 594 NA’s: 258 ENTERTAINMENT, ACCOMODATIONS, AND FOOD SERVICES: 569 NA’s : 40
NA NA Other :2926 Did Not Attend :4278 Associate/Junior college (AA) : 422 NA RETAIL TRADE : 558 NA
NA NA NA NA’s : 286 None : 402 NA MANUFACTURING : 383 NA
NA NA NA NA Master’s degree (MA, MS) : 298 NA (Other) :1993 NA
NA NA NA NA (Other) : 344 NA NA’s : 354 NA

NUMERIC VARIABLES

income totalincarceration familyIncome
Min. : -2 Min. :0.0000 Min. : 0
1st Qu.: 17500 1st Qu.:0.0000 1st Qu.: 30000
Median : 30000 Median :0.0000 Median : 52000
Mean : 34106 Mean :0.1338 Mean : 59394
3rd Qu.: 47000 3rd Qu.:0.0000 3rd Qu.: 81920
Max. :111131 Max. :9.0000 Max. :220250
NA NA NA’s :946

After cleaning the data we again look at the distribution of the dataset. This time we observe that the number of outliers has reduced and the data is now more normalised as compared to the dirty data. This will help us build a better predictive model.

After cleaning the data and again comparing the income between men and women:

OBSERVATION : The above plot still suggests that men earn more than women.

The data set is still slightly skewed at the edges and still has some outliers, but we cannot remove the observations as they contribute significantly to the data set. Removing the outlier will not give the correct results for the regression model.

The skewness of data set is due to income not being symmetric and data being positively skewed.

Getting some general statistics for income vs gender to dig deeper into the data set.

gender mean sd se
female 30413.03 20847.49 395.9654
male 37619.46 23858.33 441.9725

From the above table, we make the same inference that men (Mean = 37619.4629375) earn more than women (Mean = 30413.0328283) on an average. This suggest that there is an income gap between men and women. To check the signifance of the data set we run t.test and wilcoxon rank-sum test

T Test for Income vs Gender

To test the significance of our results from the above table and bar-graphs we perform a T-Test


    Welch Two Sample t-test

data:  income by gender
t = -12.144, df = 5643.7, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -8369.73 -6043.13
sample estimates:
mean in group female   mean in group male 
            30413.03             37619.46 

OBSERVATION : T-Test suggests that the results give above are significant.

The T-Test Results suggests, income are on average 7206.4g higher for men as compared to women (t-statistic -12.14, p=0, 95% CI [-8369.7, -6043.1]g). By observing the P-value (0), we can confirm the impact of gender on income is significant. It supports our hypothesis, that men make more money than women.

As the data is not completely normal at the edges I also performed a wilcoxon test to check the significance of the result (wilcoxon rank-sum test does not take the assumption the data is normally distributed.)

Wilcoxon Test for Income vs Gender

    Wilcoxon rank sum test with continuity correction

data:  income by gender
W = 3328400, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0
95 percent confidence interval:
 -8000 -5001
sample estimates:
difference in location 
                 -7000 

OBSERVATION : Wilcoxon Test also supports the results produced above.

But we should also take into other factors which might have an impact on income which is not taken into consideration while discussing about gender income wage.

Taking into consideration other variables to check their impact on income and income gap :

EXPLORING OTHER VARIABLES

1. Total Incarcerations

Data summary for total number of incarcerations against Income Gap

totalincarceration income.gap
0 8637.189
1 9199.310
2 9281.953
3 2555.756
4 -18862.682
5 NaN
6 NaN
7 9930.000
8 NaN
9 NaN

We start with the effect of total incarceration on income. We observe that the income is correlated with total number of incarcerations by a correlation factor of -0.12. This suggests that with every 1 increase of incarcertion the income reduces by 12.31%.

Now to check the same relation with income we use a box plot to observe income and total incarceration for men and women.

From the above graph, we observe that the income reduces with more number of incarcerations but the income gap is not significantly different for both the genders. Also, we can see that the comparison is not available for a few data points and the data has many outliers which make the results of the variable not so significant.

From the left graph we can observe the effect of total number of incarcerations on income for both the genders. We can see that as the number of incarcerations increase the income reduces. We consider the same effect of total incarceration on both the genders

From the right graph we can observe the effect of total number of incarceration on income gap for both the genders. For both the gender the association with income is negative and almost the same. ( Slight higher for females than males.)

We can also confirm this using the correlation between income and total incarcerations from the graph above. The correlation value for men is -0.184 and for women is -0.091. This tells us that the association between the number of incarceration and the income very loosely seems to depend on the gender. Both for males and females, there is a negative association between the number of incarceration and income.(Greater number of incarcerations will lead to less income).

Finally to check the significance of the results performed above we use an anova test

                            Df        Sum Sq     Mean Sq F value Pr(>F)
gender                       1   73776198960 73776198960 149.919 <2e-16
totalincarceration           1   66129372004 66129372004 134.380 <2e-16
gender:totalincarceration    1     179606366   179606366   0.365  0.546
Residuals                 5682 2796154489170   492107443               
                             
gender                    ***
totalincarceration        ***
gender:totalincarceration    
Residuals                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Result The p-value (0.546) which is more than 0.05 level, is not significant for gender and total incarceration. The data suggests that there is an association between income and total incarcerations but not a relation betwen income gap and total Incarcerations. Opposite to our hypothesis, total number of incarcerations does not affect the income gap between males and females.

Reject total incarcerations factor from income gap model

2. Physical and Emotional Condition at School/Work

Data summary for physical and Emotional Condition against Income Gap

physicalEmotionalCondition income.gap
No 7303.051
Yes 5106.794

To check if physical and emotional condition impact the income for men and women we plot bargraphs on the average incomes for both the genders:

The first graph suggests that the physical-emotional condition at school/work which impacts your work does have an effect on the income of the respondent. But if we look at the second graph we notice that the impact is not much on income gap for men and women.

The variance for income gap with “yes” response for physical-emotional condition is very high, while the income gap between “yes” and “no” is not that much making the variable insignificant for the income gap model.

We can support this analysis with an anova test on gender and physical-emotional condition factor on income.

                                    Df        Sum Sq     Mean Sq F value
gender                               1   62497570350 62497570350 126.139
physicalEmotionalCondition           1   23570462845 23570462845  47.572
gender:physicalEmotionalCondition    1     349334444   349334444   0.705
Residuals                         5047 2500621017323   495466815        
                                    Pr(>F)    
gender                             < 2e-16 ***
physicalEmotionalCondition        5.95e-12 ***
gender:physicalEmotionalCondition    0.401    
Residuals                                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
635 observations deleted due to missingness

Result The p-value (0.401) which is more than 0.05 level is not significant for gender and physical-emotional condition. The data suggests that there is an association between income and physical-emotional condition but not a relation betwen income gap and physical health. Against our hypothesis, physical and emotional condition has the same effect on both males and female incomes.

Reject physical-emotional condition factor from income gap model

3. Race

Data summary for race against Income Gap

race income.gap
Black 3401.801
Hispanic 8718.042
Mixed 6227.741
Other 7568.899

To check if race impact the income for men and women we plot bargraphs on the average incomes for both the genders:

From the above graphs we notice, that race has an impact on income gap between males and females. The gap is the largest for Hispanic and lowest for Blacks. Also, the income gap for mixed race can be ingored as the variance for mixed race is very high and can be ignored from our conclusion. To confirm our analysis, we can perform an anova test on gender and race over income.

              Df        Sum Sq     Mean Sq F value  Pr(>F)    
gender         1   73776198960 73776198960 152.277 < 2e-16 ***
race           3  105733716539 35244572180  72.746 < 2e-16 ***
gender:race    3    5810933956  1936977985   3.998 0.00744 ** 
Residuals   5678 2750918817046   484487287                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Result The p-value (0.007) which is less than 0.05 level and is significant for gender and race. The data suggests that there is an association between income gap and race. The results support our hypothesis that race has an effect on the income gap for men and women.

Accept race for income gap model

4. Number of Biological Children at Home

Data summary for Number of Children at home against Income Gap

biologicalChild income.gap
0 4584.192
1 11805.627
2 15905.450
3 25452.801
4 25101.957
5 12187.550
6 NaN

We start with the effect of number of children on income. We observe that the income is correlated with number of children at home by a correlation factor of -0.05. This suggests that with every child the income reduces by 5.41%.

Now to check the same relation with income gap we use a box plot to observe income and number of children for men and women.

From the above graph, we notice there is a big gap median income for males and females. Men earn much higher when they children at home as compared to women. Probably because the females have to take care of the children at home, so they cant take up full time jobs. The difference in income is very clearly visible between the two gender sets.

From the left graph we can observe the effect of number of children on income for both the genders. We consider that number of children have the same effect for both the genders and as number of children increase the income reduces.

From the right graph we can observe the effect of number of children have opposite effect on both the genders. For women the association with income is negative as women need to take care of the children at home. While for men the association of number of children with income is postive as they need to earn higher for supporting the family.

I found the opposite correlation between number of children and income for men and women interesting

The correlation value for men is 0.231 and for women is -0.189. This tells us that the association between the number of incarceration and the income seems to depend on the gender. Among females, this association is negative (with more number of children in the house the females earn less), while among males, the association is positive.(Males earn more if there are more number of children at home)

We can confirm our results from anova test on biological Children and gender over income

                         Df        Sum Sq     Mean Sq F value Pr(>F)    
gender                    1   94327983656 94327983656 223.062 <2e-16 ***
biologicalChild           1    1088353752  1088353752   2.574  0.109    
gender:biologicalChild    1   55192158147 55192158147 130.516 <2e-16 ***
Residuals              2766 1169680891635   422878124                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Result The p-value (0.007) is significant at the 0.05 level for gender and number of children, so the data suggests that there is an association between income, gender and Number of children.

Accept number of children factor for income gap model

5. College Type

Data summary for College Type against Income Gap

collegeType income.gap
Private for Non Profit 10825.860
Private for Profit 6863.177
Public 8076.693
Did Not Attend 7033.632

To check if college type impact the income for men and women we plot bargraphs on the average incomes for both the genders:

From the above graphs, we observe that the income gap for all types of colleges is almost the same. The income gap is slightly more for “Private-for non profit” college type but at the same time the variance is very high for “Private- for non profit college” (By looking at the error bars in income gap graph). High variance makes the variable non-significant. Due to these reasons we can drop college type variable from our income gap model.

We can support this analysis with an anova test on gender and college type factor on income.

                     Df        Sum Sq     Mean Sq F value  Pr(>F)    
gender                1   73422079347 73422079347 146.786 < 2e-16 ***
collegeType           3   20840428700  6946809567  13.888 5.1e-09 ***
gender:collegeType    3     723213834   241071278   0.482   0.695    
Residuals          5392 2697073270722   500199049                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
286 observations deleted due to missingness

Result The p-value (0.695) which is more than 0.05 level is not significant for gender and college type. Against our hypothesis, college type has the same effect on both males and female incomes.

Reject college type from income gap model

6. Drug Use

Data summary for Drug Use against Income Gap

drugUse income.gap
Yes 5110.492
No 7578.205

To check if college type impact the income for men and women we plot bargraphs on the average incomes for both the genders:

From the above graph, we can observe that the variance for “yes” response is very high which makes the yes response insignificant and nullifies the effect of difference in income gap between men and women.

We can support this analysis with an anova test on gender and drug use on income.

                 Df        Sum Sq     Mean Sq F value  Pr(>F)    
gender            1   74256771082 74256771082 147.780 < 2e-16 ***
drugUse           1    5849146011  5849146011  11.641 0.00065 ***
gender:drugUse    1     295670806   295670806   0.588 0.44306    
Residuals      5424 2725461157412   502481777                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
258 observations deleted due to missingness

Result The p-value (0.443) which is more than 0.05 level is not significant for gender and drug use. Against our hypothesis, drug use has the same effect on both males and female incomes.

Reject drug use from income gap model

7. Industry

Data summary for Industry against Income Gap

industry income.gap
ACS SPECIAL CODES -10116.667
ACTIVE DUTY MILITARY 13399.900
AGRICULTURE, FORESTRY AND FISHERIES 21194.571
CONSTRUCTION 8720.578
EDUCATIONAL, HEALTH, AND SOCIAL SERVICES 6867.232
ENTERTAINMENT, ACCOMODATIONS, AND FOOD SERVICES 4307.533
FINANCE, INSURANCE, AND REAL ESTATE 7054.947
INFORMATION AND COMMUNICATION 4440.687
MANUFACTURING 4758.557
MINING 3780.067
OTHER SERVICES 5496.095
PROFESSIONAL AND RELATED SERVICES 2194.134
PUBLIC ADMINISTRATION 14443.163
RETAIL TRADE 4449.122
TRANSPORTATION AND WAREHOUSING 13037.409
UTILITIES 35126.383
Not Working 3280.049
WHOLESALE TRADE 7655.057

To check if Industry impact the income for men and women we plot bargraphs on the average incomes for both the genders:

From the bar charts given above, we can see that industry effects both men and women. For example ACS Special Codes, females earn higher than males while in all ther other the males tend to earn a higher income. For some of the industries the variance is very high making the industry varible insignificant for our model for example Information and communcication as well as construction.

As hypothesised, industry does make a difference because some of the industries are more men dominated while some are women dominated. I also had to remove some of the industries from my analysis such as Military where the same set was unevenly distributed with 20 men and 1 woman. Such a data set does not provide us with any important information. With multiple industries behaving differently it is difficult to decide whether to keep the variable or not. To understand the correlation of industry with income gap, using anova test for income with gender and industry

                  Df        Sum Sq     Mean Sq F value Pr(>F)    
gender             1   67992597522 67992597522 152.671 <2e-16 ***
industry          17  313329119187 18431124658  41.385 <2e-16 ***
gender:industry   17   17323618723  1019036395   2.288 0.0019 ** 
Residuals       5296 2358599928647   445354971                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
354 observations deleted due to missingness

Result The p-value (0.002) which is less than 0.05 level is significant for gender and industry. Supporting our hypothesis, industry has an impact on the income gap between men and women.

Accept from income gap model

8. Family Income

We start by checking the relation between family income and income using scatter plots

The graph is densely populated and there is postive association between income and family income. Since respondent’s income is part of the total final income, we need to check the correlation between income and family income. If they are highly correlated, we need to remove variable from our model as it give us the wrong results for its impact on income.

The above matrix suggest that income is slightly correlated to family income with a correlation factor of 0.41 We can infer from this that family income even though might be significant, its results might be exaggerated due its correlation value.

From the left graph we can observe the effect of family income on income for both the genders. We consider that family income has the same effect for both the genders (increase in family income leads to an increase in income)

For the right graph we consider a different effect on males and females. Though from the graph we can observe almost the same effect of family income on both men and women.

By looking at the graphs above we observe that family income doees have an impact on income and income gap. The correlation value for men is 0.429 and for women is 0.422. For both the genders, there is a positive association between income and family income (More family income lead to more income). But we must also observe the correlation between income and family income from the above matrix.

This suggests that even though family income is signifcant for our model, though due to correlation it does not contribute to our model as much as it is depicted by the results.

We can support this analysis with an anova test on gender and drug use on income.

                      Df        Sum Sq      Mean Sq F value   Pr(>F)    
gender                 1   78671277907  78671277907   194.5  < 2e-16 ***
familyIncome           1  419408434310 419408434310  1037.0  < 2e-16 ***
gender:familyIncome    1    5015972393   5015972393    12.4 0.000433 ***
Residuals           4736 1915468049549    404448490                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Family Income is not causal This is the old adage that correlation does not imply causation. In this example, we have strong evidence that higher family income is positively associated with respondents income. This doesn’t mean that decreasing family income will lower the income. The relationship is not causal – at least not in that direction. A more reasonable explanation is that higher income will lead to higher family income.

Result The p-value (0) which is less than 0.05 level is significant for gender and family income. We can observe there is some correlation between income and family income Supporting which supports our our hypothesis, family Income has an impact on the income gap between men and women.

Accept family income for income gap model

9. Marital Status

Data summary for Marital Status against Income Gap

maritalStatus income.gap
Married 10696.111
Not Married 2582.938

To check if marital status impact the income for men and women we plot bargraphs on the average incomes for both the genders:

By looking at the graphs above we observe that the marital status has a very different impact on men and women. The correlation value for men is 0.231 and for women is -0.189. This tells us that the association between marital status and income seems to depend on the gender. Among females, this association is negative (married women earn less compared to unmarried), while among males, the association is positive.(married men earn more than unmarried men)

To check the significance of the results performed above we use an anove test

##                        Df        Sum Sq     Mean Sq F value   Pr(>F)    
## gender                  1   73808131692 73808131692  151.20  < 2e-16 ***
## maritalStatus           1   59363983695 59363983695  121.61  < 2e-16 ***
## gender:maritalStatus    1   22436194453 22436194453   45.96 1.33e-11 ***
## Residuals            5642 2754127161425   488147317                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 40 observations deleted due to missingness

Result The p-value (0) which is less than 0.05 level is significant for gender and marital status. Supporting our hypothesis, marital status has an impact on the income gap between men and women.

Accept marital status for income gap model

LINEAR REGRESSION MODEL

After checking the all the variables individually with the income gap we have come down to 5 variables using the anova test

All these variables had a significant effect on the income gap between males and females as we observed the results to be when we considered each variable seperately.

Now building a regression model on these variables along with gender to understand the income gap between and men and women:

lm(income ~ gender + race + biologicalChild + familyIncome + industry + maritalStatus, data = nyse_top2removed)

Estimate Std. Error t value Pr(>|t|)
(Intercept) 9543.684 10547.731 0.905 0.3657
gendermale 12128.720 900.815 13.464 0.0000
raceHispanic 1860.473 1057.157 1.760 0.0786
raceMixed 5348.945 4195.600 1.275 0.2025
raceOther 3389.917 951.958 3.561 0.0004
biologicalChild 606.428 380.801 1.593 0.1114
familyIncome 0.204 0.011 18.542 0.0000
industryACTIVE DUTY MILITARY 18168.632 11872.683 1.530 0.1261
industryAGRICULTURE, FORESTRY AND FISHERIES 785.957 11248.101 0.070 0.9443
industryCONSTRUCTION 4264.167 10520.911 0.405 0.6853
industryEDUCATIONAL, HEALTH, AND SOCIAL SERVICES 4466.955 10433.196 0.428 0.6686
industryENTERTAINMENT, ACCOMODATIONS, AND FOOD SERVICES -4292.787 10471.265 -0.410 0.6819
industryFINANCE, INSURANCE, AND REAL ESTATE 9820.070 10528.107 0.933 0.3511
industryINFORMATION AND COMMUNICATION 3260.241 10906.055 0.299 0.7650
industryMANUFACTURING 4918.710 10503.670 0.468 0.6396
industryMINING 22045.524 11637.037 1.894 0.0583
industryOTHER SERVICES -1664.867 10530.469 -0.158 0.8744
industryPROFESSIONAL AND RELATED SERVICES 811.216 10476.694 0.077 0.9383
industryPUBLIC ADMINISTRATION 13775.708 10576.696 1.302 0.1929
industryRETAIL TRADE 169.686 10464.912 0.016 0.9871
industryTRANSPORTATION AND WAREHOUSING 5366.190 10587.090 0.507 0.6123
industryUTILITIES 13704.909 11388.756 1.203 0.2290
industryNot Working -7337.938 10514.318 -0.698 0.4853
industryWHOLESALE TRADE 1569.235 10651.546 0.147 0.8829
maritalStatusNot Married -2558.104 906.506 -2.822 0.0048

Some of the inferences based on this model

  1. Looking at the p-values, it looks like gendermale is statistically significant predictor of income with a p-value r round(summary(income.lm)[[4]][27*3 - 2],3).

  2. Males earn 12128.7202785 more than females on an average.

  3. Family Income also seems to be an statistically significant predictor of income, but we know that income and Family income have a correlation of 0.41. This reduces the significance of family income on income as this might not depict a causal effect.

  4. Marital Status as predicted in the hypothesis also has an impact on income gap and is significant with P-Value of NA

Check for interaction terms for the regression model with gender

Now that we have considered the effect of each variable seperated, we can take into account the impact of the interaction between the final variables and gender. To Understand the significance of joint effect of gender and other variables on Income we look at the interaction terms and compare the linear regression models with and without the interaction term. The P- value from the Anova test will tell us if the interaction term is significant and if we should include the term in our final regression model.

1. Race
Analysis of Variance Table

Model 1: income ~ gender + race + biologicalChild + familyIncome + industry + 
    maritalStatus
Model 2: income ~ gender + race + biologicalChild + familyIncome + industry + 
    maritalStatus + gender:race
  Res.Df          RSS Df   Sum of Sq      F    Pr(>F)    
1   2253 727243144767                                    
2   2250 710098504020  3 17144640747 18.108 1.306e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Looking at the P value, we are able to conclude that race has an impact on the income gap as the P value we get from comparing the linear regression model with and without interaction term is significant.

Inference : Race and Gender interaction term is significant with P-Value approx 0.

2. Number of Children
Analysis of Variance Table

Model 1: income ~ gender + race + biologicalChild + familyIncome + industry + 
    maritalStatus
Model 2: income ~ gender + race + biologicalChild + familyIncome + industry + 
    maritalStatus + gender:biologicalChild
  Res.Df          RSS Df   Sum of Sq      F   Pr(>F)    
1   2253 727243144767                                   
2   2252 711046366673  1 16196778093 51.298 1.07e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Looking at the P value, we are able to conclude that number of children at home has an impact on the income gap as the P value we get from comparing the linear regression model with and without interaction term is significant.

Inference : Number of Children and Gender interaction term is significant with P-Value approx 0.

3. Family Income
Analysis of Variance Table

Model 1: income ~ gender + race + biologicalChild + familyIncome + industry + 
    maritalStatus
Model 2: income ~ gender + race + biologicalChild + familyIncome + industry + 
    maritalStatus + gender:familyIncome
  Res.Df          RSS Df   Sum of Sq      F    Pr(>F)    
1   2253 727243144767                                    
2   2252 713875998042  1 13367146724 42.168 1.027e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Looking at the P value, we are able to conclude that family income has an impact on the income gap as the P value we get from comparing the linear regression model with and without interaction term is significant.

Inference : Family Income and Gender interaction term is significant with P-Value approx 0.

4. Industry
Analysis of Variance Table

Model 1: income ~ gender + race + biologicalChild + familyIncome + industry + 
    maritalStatus
Model 2: income ~ gender + race + biologicalChild + familyIncome + industry + 
    maritalStatus + gender:industry
  Res.Df          RSS Df  Sum of Sq      F Pr(>F)
1   2253 727243144767                            
2   2238 720573077386 15 6670067381 1.3811 0.1473

Looking at the P value, we are able to conclude that industry does not have an impact on the income gap as the P value we get from comparing the linear regression model with and without interaction term is not significant.

Inference : Industry and Gender interaction term is not significant with P-Value approx 0.147.

5. Marital Status
Analysis of Variance Table

Model 1: income ~ gender + race + biologicalChild + familyIncome + industry + 
    maritalStatus
Model 2: income ~ gender + race + biologicalChild + familyIncome + industry + 
    maritalStatus + gender:maritalStatus
  Res.Df          RSS Df   Sum of Sq      F    Pr(>F)    
1   2253 727243144767                                    
2   2252 710131763595  1 17111381172 54.264 2.446e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Looking at the P value, we are able to conclude that marital status has an impact on the income gap as the P value we get from comparing the linear regression model with and without interaction term is significant.

Inference : Marital and Gender interaction term is significant with P-Value approx 0.

From the five interaction models we tested, only Industry-Gender interaction term was not significant, which we can also observe from the graphs above because many industries were not significantly related to income gap in the given data set. So we will include the interaction terms for 4 variables (Race, Marital Status, Number of children and Family Income) but not for Industry.

From all the given observarions I would like to concentrate on the affect of Marital Status of the respondent on Income gap. As per the studies done, married men earn more than unmarried men while it is the opposite for women, unmarried women earn more than married women.

Final Regression Model

lm(income ~ race + industry + familyIncome + biologicalChild + gender * maritalStatus, data = nyse_top2removed)

Estimate Std. Error t value Pr(>|t|)
(Intercept) 9833.657 10425.291 0.943 0.3457
gendermale 15572.081 1005.597 15.485 0.0000
maritalStatusNot Married 2502.075 1128.999 2.216 0.0268
raceHispanic 2254.860 1046.248 2.155 0.0313
raceMixed 5962.603 4147.703 1.438 0.1507
raceOther 3626.406 941.448 3.852 0.0001
industryACTIVE DUTY MILITARY 14662.186 11744.429 1.248 0.2120
industryAGRICULTURE, FORESTRY AND FISHERIES -1185.624 11120.673 -0.107 0.9151
industryCONSTRUCTION 2400.648 10401.785 0.231 0.8175
industryEDUCATIONAL, HEALTH, AND SOCIAL SERVICES 3043.537 10313.822 0.295 0.7680
industryENTERTAINMENT, ACCOMODATIONS, AND FOOD SERVICES -5643.421 10351.263 -0.545 0.5857
industryFINANCE, INSURANCE, AND REAL ESTATE 8190.332 10408.173 0.787 0.4314
industryINFORMATION AND COMMUNICATION 1560.670 10781.848 0.145 0.8849
industryMANUFACTURING 3214.942 10384.244 0.310 0.7569
industryMINING 19211.606 11508.302 1.669 0.0952
industryOTHER SERVICES -3365.200 10410.715 -0.323 0.7465
industryPROFESSIONAL AND RELATED SERVICES -491.753 10356.516 -0.047 0.9621
industryPUBLIC ADMINISTRATION 11928.053 10456.854 1.141 0.2541
industryRETAIL TRADE -1449.321 10345.695 -0.140 0.8886
industryTRANSPORTATION AND WAREHOUSING 3571.168 10466.955 0.341 0.7330
industryUTILITIES 11552.863 11260.263 1.026 0.3050
industryNot Working -8780.117 10394.036 -0.845 0.3984
industryWHOLESALE TRADE -691.998 10532.300 -0.066 0.9476
familyIncome 0.208 0.011 19.029 0.0000
biologicalChild 162.269 381.177 0.426 0.6704
gendermale:maritalStatusNot Married -12482.300 1694.484 -7.366 0.0000

Looking at the p-values, it looks like gendermale is statistically significant predictor of income with a p-value r round(summary(final.income.lm)[[4]][27*3 - 2],3).

SOME OF THE INTERESTING OBSERVATIONS IN THE FINAL REGRESSION MODEL

  1. Looking at the analysis, we can confirm with some confidence that Males do earn more than Women. Males earn 15572.081 more than females on an average.

  2. The income difference among married men and married women is $ 12482.3 less than the difference between unmarried men and unmarried women. Marital Status is a factor that is highly associated with the income gap between men and women.

  3. The income gap is not significantly dependent on number of children at home as assumed in the hypothesis as the P-value for the given variable is less than 0.05.

  4. The Estimated income gap between unmarried men and unmarried is equal to the sum of “co-efficients of gendermale” + “co-efficients of gendermale:maritalStatusNot Married”

    = 3090

  5. Since married is the baseline for the regression model, the estimated income between men and women for married respondents = “Gendermale” + 0

    = 15572

These four plots are important diagnostic tools in assessing whether the linear model is appropriate. The first two plots are the most important.

Residuals vs. Fitted
We see that there’s a clear “funneling” phenomenon. The distribution of the residuals is quite well concentrated around 0 for small fitted values, but they get more and more spread out as the fitted values increase. This is an instance of “increasing variance”. The standard linear regression assumption is that the variance is constant across the entire range. As this assumption isn’t valid, these are clear indicators that the given linear model is inappropriate.

Normal QQ plot Residuals deviate from the diagonal line in both the upper and lower tail. The tails are observed to be ‘heavier’ (have larger values) than what we would expect under the standard modeling assumptions. This is indicated by the points forming a “steeper” line than the diagonal. The p-values to be believable, the residuals from the regression must look approximately normally distributed. So we can consider the P- Values with confidence.

Scale-location plot This is another version of the residuals vs fitted plot. As the first plot has funneling phenomenon, similarly the Red line should be horizontal but it is tilted suggesting the model is not completely accurate.

Residuals vs Leverage The data is positively skewed from the begining. Even after removing the top coded values from the data set, income is still postively skewed. For this reason, we have outliers. Points with high residual (poorly described by the model) and high leverage (high influence on model fit) are outliers. They’re skewing the model fit away from the rest of the data.

CONCLUSION

Of the so many variables hypothesised only Race, Marital Status and Industry have a major impact on the income gap. Family income, even though considered as signifcant cannot be taken as significant as there is a correlation between income and family income. The relation between income and family income is more of a consequence than a causal effect.

From the given study, we can say with some confidence that there is a significant difference in income between men and women. But this variation is also dependent on variables such as Race, Marital Status and Industry. Marital Status specially interest me as the effect on men and women was large.

I believe this gender pay gap is also prevelant due to various factors such as culture, society structure and historical factors which cannot be quantified. Some of these reasons are reflected by the factors such as race and marital Status which were studied in this analysis.

The data given by NLSY was dirty and incomplete. The data had many top coded values which if not accounted would have reduced our data set considerably. To use the given data set for making inferences, we made many assumptions and imputed multiple values. The changes made in the data set and the assumptions taken before the analysis will influence our final results.

The assumptions made during our analysis:

  1. Independence - We have collected this data over a long duration of time. To perform linear regression, we have to assume that the data is independent. For this reason our analysis might be inaccurate.

  2. Normality - The data is not completely normalized especially at the edge. The data set is skewed as we can see from Q-Q Plot generated. This linear model expression may not be good reflector for very low and high income respondents.

  3. Imputations - Due to so many incomplete values and coded values from -1 to -5, many assumptions are made to make imputations. These imputations will have an impact on the results produced in the end.

  4. Top Coded Values - The top coded values are removed from the data set as we observed from the Q-Q Plot, the top coded income values were making the data set skewed which will influence our results.

These assumptions have influenced the data, leading to a biased linear regression model. Some of the potential limitations of my analysis are

In this section you should summarize your main conclusions. You should also discuss potential limitations of your analysis and findings. Are there potential confounders that you didn’t control for? Are the models you fit believable?

As the data was not clean and not all variables were taken into consideration while building the regression model, I do not have high condifence level on my model. The training set used for building the linaer regression model was not completely normally distributed which is an assumption taken by linear regression, making this model not that reliable. The conclusions made by my analysis can be considered as possible reflection of the real world scenario but cannot be generalised to a very large scale. Policy makers can use the study to understand the possible trends but not use to make important decisions just on the basis of this study.